German Compound Analysis with wfsc

نویسنده

  • Anne Schiller
چکیده

Compounding is a very productive process in German to form complex nouns and adjectives which represent about 7% of the words of a newspaper text. Unlike English, German compounds do not contain spaces or other word boundaries, and the automatic analysis is often ambiguous. A (non-weighted) finite-state morphological analyzer provides all potential segmentations for a compound without any filtering or prioritization of the results. The paper presents an experiment in analyzing German compounds with the Xerox Weighted Finite-State Compiler (wfsc). The model is based on weights for compound segments and gives priority (a) to compounds with the minimal number of segments and (b) to compound segments with the highest frequency in a training list. The results with this rather simple model will show the advantage of using weighted finite-state transducers over simple FSTs. 1 Compound Construction A very productive word formation process in German is compounding, which combines words to build more complex words, mainly nouns or adjectives. In a large newspaper corpus the Xerox German Morphological Analyzer [1] identified 5.5% of 9,3 million tokens and 43% of overall 420,000 types as compounds. In other texts, such as technical manuals, the percentage of compound tokens may even increase (e.g. 12% in a short printer manual). This is comparable to the observations of Boroni et al. [2] who found in a 28 million newswire corpus that 7% of the tokens and 46% of the types were compounds. Regarding the construction of compounds, any adjective or noun (including proper names) may, in principle, appear as head word, and any 1 Tokens represent all words of the text, types count only different word forms. 2 Verbal compounds (such as spazierengehen) exist, but are much less productive than nouns or adjectives. They are not taken into account in this experiment. adjective, noun or verb may occur as the left-hand (“modifier”) part of a compound. – Buchseite (book page) – Großstadt (big town) – grasgrün (grass green) – Goethestück (Goethe piece) The Xerox finite-state tool [3] for German morphological analysis [1] implements this general principle without any semantic restrictions and with only a few morphosyntactic constraints concerning the so-called “linking” elements. The advantage of this very general approach is high and robust coverage. The inconvenience is potentially very ambiguous output due to “over-segmentation” of (long) words. A potential source of over-segmentation is the homonymy of compound parts with derivational affixes. – derivational suffix -ei (ing) vs. noun Ei (egg), e.g. Spielerei (playing) – Vogel#ei (bird egg) – prefix ein(in-) vs. cardinal ein (one), e.g. Einwohner#zahl (inhabitant number) – Ein#zimmer#wohnung (one room

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

How to Avoid Burning Ducks: How to Avoid Burning Ducks: Combining Linguistic Analysis and Corpus Statistics for German Compound Processing

Compound splitting is an important problem in many NLP applications which must be solved in order to address issues of data sparsity. Previous work has shown that linguistic approaches for German compound splitting produce a correct splitting more often, but corpus-driven approaches work best for phrase-based statistical machine translation from German to English, a worrisome contradiction. We ...

متن کامل

Determining Immediate Constituents of Compounds in GermaNet

In order to be able to systematically link compounds in GermaNet to their constituent parts, compound splitting needs to be applied recursively and has to identify the immediate constituents at each level of analysis. Existing tools for compound splitting for German only offer an analysis of all component parts of a compound at once without any grouping of subconstituents. Thus, existing tools ...

متن کامل

How to Avoid Burning Ducks: Combining Linguistic Analysis and Corpus Statistics for German Compound Processing

Compound splitting is an important problem in many NLP applications which must be solved in order to address issues of data sparsity. Previous work has shown that linguistic approaches for German compound splitting produce a correct splitting more often, but corpus-driven approaches work best for phrase-based statistical machine translation from German to English, a worrisome contradiction. We ...

متن کامل

The Nominations with the Inherent and Adherent Approximators in the German Language

The article deals with the description of the conceptual and linguistic nature, the modus character and the field structure of the linguistic category of the approximation. It also deals with the analysis of the role of the inherent and adherent approximators as well as the context in the realization of the invariant meaning of the approximation in the German language. Pragmatic potential of th...

متن کامل

A Nonlinear Model of Economic Data Related to the German Automobile Industry

Prediction of economic variables is a basic component not only for economic models, but also for many business decisions. But it is difficult to produce accurate predictions in times of economic crises, which cause nonlinear effects in the data. Such evidence appeared in the German automobile industry as a consequence of the financial crisis in 2008/09, which influenced exchange rates and a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005